3 research outputs found

    On developing an automatic threshold applied to feature selection ensembles

    Get PDF
    © 2019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article "R.-J. Palma-Mendoza, L. de-Marcos, D. Rodriguez, y A. Alonso-Betanzos, «B. Seijo-Pardo, V. Bolón-Canedo, y A. Alonso-Betanzos, «On developing an automatic threshold applied to feature selection ensembles», Information Fusion, vol. 45, pp. 227-245, ene. 2019" has been accepted for publication in Information Fusion. The Version of Record is available online at https://doi.org/10.1016/j.inffus.2018.02.007[Abstract]: Feature selection ensemble methods are a recent approach aiming at adding diversity in sets of selected features, improving performance and obtaining more robust and stable results. However, using an ensemble introduces the need for an aggregation step to combine all the output methods that confirm the ensemble. Besides, when trying to improve computational efficiency, ranking methods that order all initial features are preferred, and so an additional thresholding step is also mandatory. In this work two different ensemble designs based on ranking methods are described. The main difference between them is the order in which the combination and thresholding steps are performed. In addition, a new automatic threshold based on the combination of three data complexity measures is proposed and compared with traditional thresholding approaches based on retaining a fixed percentage of features. The behavior of these methods was tested, according to the SVM classification accuracy, with satisfactory results, for three different scenarios: synthetic datasets and two types of real datasets (where sample size is much higher than feature size, and where feature size is much higher than sample size).This research has been financially supported in part by the Spanish Ministerio de Economa y Competitividad (research project TIN 2015-65069-C2-1-R), by the Xunta de Galicia (research projects GRC2014/035 and the Centro Singular de Investigación de Galicia, accreditation 2016–2019) and by the European Union (FEDER/ERDF).Xunta de Galicia; GRC2014/03

    Los docentes que no han dejado de ser alumnos. Retos y experiencias en dos medios diferentes: online vs presencial

    Get PDF
    En este trabajo presentamos cómo ha sido nuestra primera experiencia docente en dos marcos distintos: por un lado en una asignatura presencial del Grado de Informática de la Universidade da Coruña y por el otro en una asignatura online en el Máster Universitario en Investigación en Inteligencia Artificial de la Universidad Internacional Menéndez Pelayo. La experiencia de impartir simultáneamente ambas asignaturas nos ha permitido conocer las diferencias entre estos dos tipos de enseñanza. Nuestra intención es poner de manifiesto cómo hemos solventado los retos que nos plantearon las dos asignaturas, a fin de que el lector pueda servirse de nuestras breves pero intensas peripecias docentes.In this work, we describe our first teaching experience in two different areas: a face-to-face subject in the Computer Science Degree of the University of A Coruña and an online subject in the Research Master’s Degree in Artificial Intelligence of the Menéndez Pelayo International University. The experience of teaching both subjects simultaneously has allowed us to know the differences between both areas. We want to show how we solved the challenges posed by these two subjects with the aim that the reader can use our brief but intense teaching adventures

    Fusión de Información e Ensembles na Aprendizaxe Automática

    No full text
    Programa Oficial de Doutoramento en Computación. 5009V01[Abstract] Traditionally, machine learning methods have used a single learning model to solve a particular problem. However, the idea of combining multiple models instead of a single one to solve a problem has its rationale in the old proverb “Two heads are better than one". The approach constructs a set of hypothesis using several different models, that then are combined in order to be able to obtain better performance than learning just one hypothesis using a unique method. There have been several studies that have shown that these models obtain usually better accuracy than individual methods, due to the diversity of the approaches and the control of the variance, taking advantage of the strengths of the individual methods and overcome their weak points at the same time. These combinations of models are called “committees", or more recently “ensembles". Ensemble learning algorithms have reached great popularity among the machine learning literature, as they achieve performances that were not possible some years ago, and thus have become a “winning horse" in many applications. Moreover, during the last years, the size of the datasets used in the area of machine learning has considerably grown. Thus, dimensionality reduction has been a must almost in any case, and among those preprocesing methods, feature selection (FS) has become an essential preprocessing step for many data mining applications, eliminating irrelevant and redundant information, and thus reducing storage requirements and improving the computational time needed by the machine learning algorithms. Also, several studies have demonstrated that feature selection can greatly contribute to improve the performance of posterior classi_cation methods. One of the main points to be addressed in this thesis is the application of the ensemble learning idea to the feature selection process, with the aim of introducing diversity and increasing the regularity of the process. Regularity is the ability of the ensemble approach to obtain acceptable results regardless of the dataset under study and its particular properties. It should also be mentioned that using ensemble approaches has the added benefit of releasing the user from the task of selecting the most adequate method for each dataset, and thus of the obligation of knowing technical details about the existing algorithms. In this way, also more user-friendly FS methods are coming into scene. Ensembles for feature selection are a recent proposal, and not many works can be found in the literature. There are several steps that need to be confronted when creating an ensemble for FS: 1. Create a set of different feature selectors, each one providing its output. In order to create diversity, there are several methods that can be used, such as using different samples of the training dataset, using different feature selection methods, or a combination of both. 2. Aggregate the results obtained by the single models. There are several measures that can be used in this step, such as majority voting, weighted voting, etc. It is important to choose an adequate aggregation method, that is able to preserve the diversity of the individual base models, while maintaining accuracy. In this thesis, we have designed several approaches for the first aforementioned step: (i) homogeneous approach, that is, using the same feature selection method with different training data and distributing the dataset over several nodes (or several partitions); and (ii) heterogeneous approach, i.e., using different feature selection methods with the same training data. Regarding the second point above, we have also studied different methods for combining the results obtained from the individual methods. Besides, when the chosen individual selectors are rankers, at some point we needed to establish a threshold to retain only the relevant features and to combine the rankings obtained by the different methods configuring the ensemble. In this sense, we have analyzed two different proposals, depending on whether thresholding was performed before or after combination. Finally, a third novelty in this work is related to the need of establishing an adequate threshold, and thus we propose a methodology for establishing automatic thresholds based on measurements of data complexity. The adequacy of the methods proposed along this thesis was checked, so as to be able to extract a series of final conclusions. To this end, a variety of datasets of different types were used: synthetic, real “classical" (more samples than features) and real DNA microarray datasets (more features than samples). In a first step, synthetic datasets were used to perform the first tests and check the performance of the new implemented methods. In a second step, real datasets (both classical and microarray) were used to check the adequacy of new methods to problems presented in the real world, allowing us to carry out a performance comparison and also to extract a series of final conclusions. Finally, nowadays it is common to find missing data in real-world problems that the proposed feature selection ensembles, as any other machine learning method, are likely to face. Traditionally, the common way to deal with this situation was to delete those samples that contained missing data, but this is not possible when the percentages of missing data are important, and thus imputation is the newly common approach. However, imputation before FS can lead to false positives: features that are not associated with the target become dependent as a result of imputation. In this exploratory work we use causal graphs to evidence the notion of structural bias, and develop a modi- fied t-statistic test to analyze the possible bias that can be originated. Our conclusion is that it is more advisable to devise feature selection methods that are “robust" to the presence of missing data than imputing them. In this regard, the development of ensemble feature selection in this scenario remains as the future line to pursue
    corecore